BioData Mining — Latest Matching Preprints

1

Identifying genes associated with phenotypes using machine and deep learning

Muneeb, M.; Ascher, D.

2026-03-07 bioinformatics 10.64898/2026.03.05.709665 medRxiv

Top 0.1%

8.2%

Show abstract

Identifying disease-associated genes enables the development of precision medicine and the understanding of biological processes. Genome-wide association studies (GWAS), gene expression data, biological pathway analysis, and protein network analysis are among the techniques used to identify causal genes. We propose a machine-learning (ML) and deep-learning (DL) pipeline to identify genes associated with a phenotype. The proposed pipeline consists of two interrelated processes. The first is classifying people into case/control based on the genotype data. The second is calculating feature importance to identify genes associated with a particular phenotype. We considered 30 phenotypes from the openSNP data for analysis, 21 ML algorithms, and 80 DL algorithms and variants. The best-performing ML and DL models, evaluated by the area under the curve (AUC), F1 score, and Matthews correlation coefficient (MCC), were used to identify important single-nucleotide polymorphisms (SNPs), and the identified SNPs were compared with the phenotype-associated SNPs from the GWAS Catalog. The mean per-phenotype gene identification ratio (GIR) was 0.84. These results suggest that SNPs selected by ML/DL algorithms that maximize classification performance can help prioritise phenotype-associated SNPs and genes, potentially supporting downstream studies aimed at understanding disease mechanisms and identifying candidate therapeutic targets.

2

Improving Causal Gene Identification Using Large Language Models

Ofer, D.; Kaufman, H.

2026-03-10 bioinformatics 10.64898/2026.03.08.710344 medRxiv

Top 0.1%

6.0%

Show abstract

Genome-Wide Association Studies (GWAS) have successfully identified numerous loci associated with complex traits and diseases, yet pinpointing causal genes remains a significant challenge. The reliance on simple proximity-based heuristics is often insufficient due to linkage disequilibrium, gene interactions, and regulatory effects. Recent advancements in Large Language Models (LLMs) have demonstrated potential in automating causal gene identification, but their effectiveness remains limited by knowledge representation and retrieval mechanisms. This study builds on previous research by evaluating LLMs for causal gene identification, with a focus on enhancing performance through Retrieval-Augmented Generation (RAG) and the incorporation of genomic distance information. We replicate prior results using smaller model Qwen2.5--assessing their predictive accuracy using a benchmark dataset from Open Targets. We improved the preformences when integrating RAG-based literature retrieval (F1 = 0.795) and gene distance information (F1 = 0.806). However, the combined approach yielded diminishing returns, suggesting interactions between these enhancements. Error analysis revealed that genomic distance features improved predictions by reinforcing established heuristics, while RAG enhanced domain knowledge but occasionally led to semantic biases. These findings highlight the potential of hybrid approaches in leveraging both structured genomic features and unstructured textual data.

3

Benchmark of biomarker identification and prognostic modeling methods on diverse censored data

Fletcher, W. L.; Sinha, S.

2026-04-01 bioinformatics 10.64898/2026.03.29.715113 medRxiv

Top 0.1%

4.9%

Show abstract

The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.

4

Reusing Blood Samples from a Hospital-based Cohort to Apixaban Plasma Concentrations

Murray, K. T.; Fabbri, D. V.; Annis, J. S.; Clark, C. R.; Pulley, J. M.; Brittain, E.; Gailani, D.

2026-04-08 pharmacology and therapeutics 10.64898/2026.04.07.26350322 medRxiv

Top 0.1%

4.3%

Show abstract

In the management of atrial fibrillation, the most frequently prescribed oral anticoagulant is apixaban, given at a fixed dose of 5mg BID. Apixaban is predominantly metabolized by cytochrome P4503A4 (CYP3A4) and is also a substrate for the drug efflux transporter P-glycoprotein (P-gp). In nearly 300,000 Medicare patients with AF receiving apixaban, we previously showed that concomitant therapy with drugs that inhibit both CYP3A4 and P-gp, specifically amiodarone or diltiazem, significantly increased serious bleeding that caused hospitalization and/or death. We hypothesized that this adverse effect was mediated by an increase in apixaban plasma concentrations caused by concomitant therapy that reduced drug elimination. Utilizing left-over samples obtained from clinically indicated blood draws that would typically be discarded, the Vanderbilt University Medical Center biobank BioVU contains >353,000 samples linked to de-identified electronic medical records (EMRs), with both DNA and plasma harvested. Of 35 samples drawn from patients taking apixaban 5mg BID, 5 were identified to be drawn from patients concomitantly taking drugs inhibiting both CYP3A4 and P-gp. Using a chromogenic anti-Xa assay, we found that plasma concentrations of apixaban were significantly higher (347{+/-}64 ng/mL; mean{+/-}SEM) for patients receiving concomitant CYP3A4/P-gp-inhibiting drugs compared to those not treated with these drugs (166{+/-}67 ng/mL; P=0.025, Mann Whitney). There were no differences between the 2 patient groups with respect to age, weight, or serum creatinine. The results of this pilot study provide preliminary data to support our hypothesis, and they demonstrate the practicality of obtaining pharmacokinetic data from a large cohort of plasma samples linked to deidentified EMRs. This approach could be used to define the role of apixaban levels in high-risk clinical scenarios and to better understand the relationship between drug levels and bleeding risk.

5

Benchmarking 80 binary phenotypes from the openSNP dataset using deep learning algorithms and polygenic risk score tools

Muneeb, M. -; Ascher, D.; Myung, Y.; Feng, S.; Henschel, A.

2026-03-09 bioinformatics 10.64898/2026.03.06.710126 medRxiv

Top 0.1%

4.3%

Show abstract

Genotype-phenotype prediction plays a crucial role in identifying disease-causing single nucleotide polymorphisms and precision medicine. In this manuscript, we benchmark the performance of various machine/deep learning algorithms and polygenic risk score tools on 80 binary phenotypes extracted from the openSNP dataset. After cleaning and extraction, the genotype data for each phenotype is passed to PLINK for quality control, after which it is transformed separately for each of the considered tools/algorithms. To compute polygenic risk scores, we used the quality control measures for the test data and the genome-wide association studies summary statistic file, along with various combinations of clumping and pruning. For the machine learning algorithms, we used p-value thresholding on the training data to select the single nucleotide polymorphisms, and the resulting data was passed to the algorithm. Our results report the average 5-fold Area Under the Curve (AUC) for 29 machine learning algorithms, 80 deep learning algorithms, and 3 polygenic risk scores tools with 675 different clumping and pruning parameters. Machine learning outperformed for 44 phenotypes, while polygenic risk score tools excelled for 36 phenotypes. The results give us valuable insights into which techniques tend to perform better for certain phenotypes compared to more traditional polygenic risk scores tools.

6

From SNPs to Pathways: A genome-wide benchmark of annotation discrepancies and their impact on protein- and pathway-level inference

Queme, B.; Muruganujan, A.; Ebert, D.; Mushayahama, T.; Gauderman, W. J.; Mi, H.

2026-03-24 bioinformatics 10.64898/2026.03.21.713397 medRxiv

Top 0.1%

3.9%

Show abstract

BackgroundAccurate single-nucleotide polymorphism (SNP) annotation is central to genomic research yet widely used tools and gene models often yield divergent results. Prior studies have shown such discrepancies in small datasets, but the extent of genome-wide variation and its impact on downstream pathway analysis remain unclear. ResultsWe conducted a comprehensive comparison of three commonly used SNP annotation tools, ANNOVAR, SnpEff, and VEP, using both Ensembl and RefSeq gene models to evaluate more than 40 million SNPs from the Haplotype Reference Consortium. At the protein level, annotation output differed significantly across tools and gene models (p-adj < 0.001), with discrepancies present in both genic and intergenic regions. RefSeq produced broader annotation coverage, particularly for intergenic SNPs, while Ensembl showed greater internal consistency. SnpEff provided the most complete coverage overall, whereas no single tool or model configuration achieved full annotation recovery of the union reference. Integration across tools and models maximized coverage and reduced annotation loss. In a case study of 204 colorectal cancer-associated SNPs from the FIGI GWAS, pathway enrichment results varied depending on annotation strategy. The fully integrated approach identified all four significant pathways, whereas several single-tool or single-model strategies missed one or more. ConclusionSNP annotation outcomes are influenced by both the tool and gene model used, and relying on a single approach may result in incomplete coverage. A multi-tool, multi-model strategy provides the most comprehensive annotation and preserves enriched pathways, supporting more robust and reproducible genomic interpretation.

7

Predicting long-term adverse outcomes after neonatal intensive care

Ogretir, M.; Kaipainen, V.; Leskinen, M.; Lahdesmaki, H.; Koskinen, M.

2026-03-31 pediatrics 10.64898/2026.03.26.26348580 medRxiv

Top 0.1%

3.6%

Show abstract

Neonates requiring intensive care are at increased risk for long-term neuropsychiatric disorders. However, clinical adoption of risk prediction models remains limited when their performance lacks adequate interpretability for informed clinical decision-making. Here, we investigated whether longitudinal neonatal electronic health record (EHR) data from the first 90 days of life can support clinically meaningful interpretation of long-term risk signals for major neuropsychiatric diagnoses by age seven. In a retrospective register-based cohort of 17,655 at-risk children from an academic medical center, of whom 8.0\% (1,420) received a major neuropsychiatric diagnosis during follow-up, we applied a time-aware transformer model (Self-supervised Transformer for Time-Series; STraTS) and thoroughly evaluated its predictions using three complementary interpretability approaches: perturbation-based variable importance, value-dependent effect analysis, and leave-one-out (LOO) feature attribution. STraTS achieved the highest area under the precision--recall curve (AUPRC 0.171 {+/-} 0.022), compared with Random Forest (0.166 {+/-} 0.008), logistic regression (0.151 {+/-} 0.007), and XGBoost (0.128 {+/-} 0.010). Across interpretability methods, five predictors were consistently identified: birth weight, gender, Apgar score at 1 minute, umbilical serum thyroid stimulating hormone (uS-TSH), and treatment time in hospital. Indicators of early clinical severity, including chromosomal abnormalities and neonatal cerebral-status disturbances, showed the largest risk-increasing effects. Furthermore, the model's learned vector representations of subject-specific EHR sequences formed clinically coherent latent embeddings that reflect population heterogeneity along established perinatal risk dimensions. These findings demonstrate that combining multiple complementary interpretability methods yields stable, clinically plausible risk signals while revealing limitations that would remain undetected by any single approach, highlighting the importance of careful interpretability analysis of deep learning-based risk predictions.

8

Synthetic Data Generation and Nonparametric Techniques for Assessing Multivariate Similarity to Address Small-Sample Size Challenges

Heine, J.; Fowler, E.; Eschrich, S. A.; Schell, M.

2026-05-07 bioinformatics 10.64898/2026.05.04.722226 medRxiv

Top 0.1%

3.6%

Show abstract

Data modeling in biomedical research often operates in the small-sample regime, where the number of observations is small relative to the data dimensionality; the detrimental effects of limited sample sizes are well documented in cancer studies. Synthetic data offers a potential solution to data shortfalls provided that the data generated is an adequate facsimile of the underlying distribution; the adequacy of such synthetic data remains an open-ended problem. In this work, we evaluate a synthetic generator proposed previously. The generator applies a series of transformations to the observed data to accommodate the small-sample size resulting in an uncoupled representation, where uncorrelated marginal distributions are modeled with optimized univariate kernel density estimation. In this report, (1) we develop a nonparametric method for assessing multivariate similarity based on the Cramer-Wold theorem and random projection testing, (2) investigate when the absence of bivariate correlation approximates independence in a non-normal setting, and (3) evaluate artifacts induced by data compression. The presentation is primarily methodological; low-dimensional data were used so each stage of the generation process could be analyzed explicitly. A formal testing framework was developed by comparing random projection level outcomes with a two-sample test, modeling these outcomes as Bernoulli trials, aggregating replicate outcomes within each projection direction, and pooling outcomes across many directions, yielding a scalable standardized normal test-statistic. The key innovation was decoupling the two-sample test significance level from that governing finalized normal inference. We showed the same projection framework also evaluates the full multivariate covariance structure. The generator produced high-fidelity multivariate synthetic data when the bivariate correlation approximates independence in the non-normal setting; in highly compressed data, residual modes were best modeled as normally distributed regardless of their intrinsic distributional form. Ongoing work includes applying these methods to higher-dimensional, diverse data.

9

A Multimodal Clinical Dataset of Early Adversity, Placement History, and Prenatal Exposures in Adopted and Foster Care Children

Sullivan, C. R.; Anderson, S.; Caola, L.; Rawstern, T.; Loleng, J.; Roghair, J.; Dastin-Van Rijn, E.; Gustafson, K.; Randolph, A.

2026-05-29 pediatrics 10.64898/2026.05.27.26354273 medRxiv

Top 0.1%

3.6%

Show abstract

We assembled a multimodal clinical dataset describing demographics, placement history, prenatal substance exposure (PSE), birth characteristics, adverse childhood experiences (ACEs), International Classification of Diseases (ICD) diagnoses, and laboratory results for 3,685+ pediatric patients evaluated between 2014 and 2024 at the University of Minnesotas Adoption Medicine Clinic (AMC). Data were curated from electronic medical records through a combined manual and automated extraction protocol using a standardized operating procedure. The resulting dataset integrates structured EMR fields including neuropsychological, laboratory, and diagnostic information with manually pulled fields of ACE scores, PSE history, and placement history. We provide an overview of the population represented and describe the datasets structure, variable definitions, and validation procedures. This resource enables investigations into how early adversity impacts medical and developmental outcomes, and provides one of the largest standardized clinical placement history, PSE, and ACE datasets in an adoption and foster care pediatric population.

10

Early life factors documented in electronic health records predict recurrent acute otitis media

Hurst, J. H.; Zhao, C.; Raynor, E. M.; Lee, J.; Gitomer, S. A.; Woods, C. W.; Kelly, M. S.; Smith, M. J.; Goldstein, B. A.

2026-03-09 pediatrics 10.64898/2026.03.07.26347843 medRxiv

Top 0.1%

3.3%

Show abstract

Background and ObjectivesRecurrent acute otitis media (rAOM; defined as [≥]3 AOM episodes in 6 months or [≥]4 episodes in 12 months) affects 10-15% of children in the United States and is a leading cause of healthcare utilization and antibiotic prescriptions. Prospective identification of children at risk of rAOM could help target interventions and identify new risk factors to guide preventive approaches. We therefore sought to develop predictive models to identify children at risk of rAOM using electronic health records (EHR) data. MethodsWe extracted retrospective EHR data for children who were born at Duke University Health System (DUHS) hospitals between January 1, 2014, and June 30, 2022, and who had at least one AOM episode during the study period. We used LASSO to build predictive models for development of rAOM at each episode and identified factors associated with rAOM. ResultsWe identified 6,566 children who met the study criteria, including 1,634 (24.8%) who met criteria for rAOM. A model using only data available at the first AOM episode had an area under the curve (AUC) of 0.75 (0.73, 0.77) and an Area Under the Precision Recall Curve (AUPRC) of 0.41 (95% CI 0.37, 0.46), indicating moderate discriminative ability. At the time of the first AOM episode, features associated with subsequent rAOM development included age, number of prior antibiotic prescriptions, and diagnosis of gastroesophageal reflux disease (GERD). Further, children who developed rAOM were more likely to experience treatment failure than children who did not meet rAOM criteria across all episodes. ConclusionsOur findings indicate that clinical exposures and patient characteristics documented in the EHR distinguish children who are at risk of developing rAOM. Such models could be deployed within EHR systems to identify children who would benefit from early evaluation by an otolaryngologist and audiologist.

11

Benchmarking Heritability Estimation Strategies Across 86 Configurations and Their Downstream Effect on Polygenic Risk Score Performance

Muneeb, M.; Ascher, D.

2026-04-02 bioinformatics 10.64898/2026.04.02.716079 medRxiv

Top 0.1%

3.2%

Show abstract

ObjectiveSNP heritability estimates vary substantially across estimation strategies, yet the downstream consequences for polygenic risk score (PRS) construction remain poorly characterised. We systematically benchmarked heritability estimation configurations and assessed their propagation into downstream PRS performance. MethodsWe benchmarked 86 heritability-estimation configurations spanning six tool families (GEMMA, GCTA, LDAK, DPR, LDSC, SumHer) and ten method groups across 10 UK Biobank phenotypes, yielding 844 configuration-level estimates. Each estimate was propagated into GCTA-SBLUP and LDpred2-lassosum2 PRS frameworks and evaluated across five cross-validation folds using null, PRS-only, and full models. Eleven binary analytical contrasts were tested using Mann-Whitney U tests to identify drivers of heritability variability. ResultsHeritability ranged from -0.862 to 2.735 (mean = 0.134, SD = 0.284), with 133 of 844 estimates (15.8%) negative and concentrated in unconstrained estimation regimes. Ten of eleven analytical contrasts significantly affected heritability magnitude, with algorithm choice and GRM standardisation showing the largest effects. Despite this upstream variability, downstream PRS test performance was only weakly coupled to heritability magnitude: pooled Pearson correlations between h2 and test AUC were r = -0.023 for GCTA-SBLUP and r = +0.014 for LDpred2-lassosum2 (both non-significant). ConclusionSNP heritability is best interpreted as a configuration-sensitive modelling parameter rather than a universally stable scalar input. Heritability estimates should always be reported alongside their full estimation specification, and downstream PRS performance is comparatively robust to moderate variation in the heritability input. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=80 SRC="FIGDIR/small/716079v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@112929borg.highwire.dtl.DTLVardef@573c36org.highwire.dtl.DTLVardef@132170borg.highwire.dtl.DTLVardef@1871363_HPS_FORMAT_FIGEXP M_FIG C_FIG

12

Development and validation of an XGBoost model with SHAP-based interpretability and a web-based calculator for predicting extrauterine growth restriction in preterm infants

Xu, Z.; Yu, C.-L.; Zhang, J.-X.

2026-04-02 pediatrics 10.64898/2026.04.01.26349838 medRxiv

Top 0.1%

3.1%

Show abstract

Background: Extrauterine growth restriction (EUGR) is a common and clinically significant complication among preterm infants, contributing to adverse neurodevelopmental and metabolic outcomes. Early and individualized risk prediction remains challenging. This study aimed to develop and validate an interpretable machine learning model for early prediction of EUGR using routinely available clinical variables, and to implement a user-friendly web-based calculator for clinical use. Methods: We retrospectively analyzed 1,431 preterm infants admitted within 24 hours after birth to our hospital between May 2020 and March 2025. Infants from the Yangpu campus (n=863) formed the training set, and those from the Huangpu campus (n=568) formed the validation set. Early clinical variables available within 48-72 hours were screened using the Boruta algorithm. Logistic regression, XGBoost, random forest, decision tree, and support vector machine models were developed and compared. Model performance was evaluated using area under the curve (AUC), accuracy, sensitivity, specificity, F1 score, and Brier score. SHapley Additive exPlanations (SHAP) were applied to assess global and individual feature contributions, nonlinear effects, and interactions. A web-based calculator was constructed based on the optimal model. Results: Nine variables were identified as important predictors: birth weight, small for gestational age status, gestational age, breastfeeding, multiple gestation, neonatal respiratory distress syndrome, patent ductus arteriosus, maternal hypertension, and maternal group B Streptococcus infection. Among the five models, XGBoost achieved the best performance in the validation set (AUC 0.922, accuracy 0.849, Brier score 0.108). SHAP analysis showed that low birth weight, small for gestational age, maternal group B Streptococcus infection, and patent ductus arteriosus were major risk factors, while breastfeeding was protective. Notable nonlinear and interactive effects were observed, particularly between birth weight and gestational age and between breastfeeding and patent ductus arteriosus. The web-based calculator provides real-time individualized risk estimation and visualized interpretation. Conclusions: An interpretable XGBoost-based model and web calculator were successfully developed and validated for early prediction of EUGR in preterm infants. This tool may support clinicians in identifying high-risk infants and guiding individualized nutritional and clinical management.

13

DxFit: An ensemble method for identifying EHR diagnoses consistent with a molecular finding

Torene, R. I.; Meltz Murphy, K.; Brandt, T.; Retterer, K.

2026-04-28 genomics 10.64898/2026.04.24.720629 medRxiv

Top 0.1%

2.8%

Show abstract

As population DNA sequencing becomes more common, genomic-first approaches are increasingly used to identify individuals with possible rare genetic disorders. To accurately estimate prevalence and penetrance, these studies often confirm manifestation of the disorder using electronic health records (EHRs). Multiple strategies exist to search the EHR for diagnoses of rare disorders, however, each has its limitations. We have developed a portable, ensemble tool, DxFit, that mines EHR data (ICD codes and structured diagnosis descriptions from billing code and problem list tables) for a diagnosis consistent with a given rare genetic disorder. DxFit combines evidence across four strategies: (1) gene name searches in diagnosis descriptions and notes, (2) ICD conversion to Mondo rare disorder ontology to find exact and nearby matches, (3) word embedding similarity searches, and (4) Jaccard similarity matches. DxFit prioritizes the match type and outputs the most confident match for each participant-disorder pair. On a cohort of 350 participants with a known positive result from diagnostic genetic testing for developmental disorders, DxFit had a sensitivity of 88.7% and specificity of 86.2% using default parameters. Adjusting the linguistic scoring thresholds from 0.8 to 0.7 and allowing for synonymous matches yielded a sensitivity of 92.7% and specificity of 84.5%. Partitioning EHR evidence into windows before and after genetic testing demonstrates, as expected, that the overall DxFit rates increase after testing and the match types become more confident. DxFit is available to the public and has extensive customization options to support a wide range of uses. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=187 HEIGHT=200 SRC="FIGDIR/small/720629v1_ufig1.gif" ALT="Figure 1"> View larger version (41K): org.highwire.dtl.DTLVardef@d71d00org.highwire.dtl.DTLVardef@b11a9eorg.highwire.dtl.DTLVardef@14a9304org.highwire.dtl.DTLVardef@fa23aa_HPS_FORMAT_FIGEXP M_FIG C_FIG

14

Cosine Similarity Conflates Clinically Distinct Cancer Variants: A Case for Typed-Graph Retrieval in Precision Oncology Decision Support

Khan, U. A.

2026-05-11 bioinformatics 10.64898/2026.05.05.723102 medRxiv

Top 0.2%

2.4%

Show abstract

Retrieval-augmented generation (RAG) is increasingly applied to clinical decision support in oncology, where treatment selection depends on identifying a patients specific somatic variant from an NGS report and matching it to evidence-graded therapy options. The vector retrieval that underlies most RAG systems uses cosine similarity over text embeddings, an architecture optimized for linguistic proximity rather than entity-level identity. We hypothesize that cosine-similarity-based retrieval conflates clinically distinct cancer variants at clinically relevant rates, while a typed-graph approach in which each variant is a discrete node preserves variant-level identity by construction. We evaluated 9 cancer variant pairs known to have differential FDA-approved therapy indications, with variant identity informed by the CIViC clinical variant evidence database and primary clinical literature. Variant pairs included BRAF V600E vs V600K (melanoma), EGFR L858R vs T790M (NSCLC, the canonical sensitivity-vs-resistance pair), EGFR exon 19 deletion vs L858R, KRAS G12C vs G12D (only G12C has FDA-approved targeted therapy), KRAS G12C vs G12V, ERBB2 amplification vs activating mutation, two PIK3CA hotspot pairs, and NTRK1 fusion vs point mutation. We computed pairwise cosine similarity for each variants text representation across three open-source embedding models (PubMedBERT, MedCPT, BGE-large-en-v1.5) and three text formats (short, medium, long). Across the medium format (gene + variant + tumor type), 100% of clinically distinct variant pairs (9/9) had cosine similarity [≥] 0.95 under both biomedical encoders (PubMedBERT, MedCPT). The general-purpose encoder (BGE-large-en-v1.5) showed lower conflation in the medium format (11%) but rose to 100% with added clinical context. At the more stringent {tau} = 0.99 (averaged across formats), PubMedBERT conflated 56% of pairs and MedCPT conflated 22%. The biomedically pre-trained encoders performed worse, not better, than the general-purpose encoder. The typed-graph baseline achieves zero conflation by construction. We discuss the architectural implications: vector retrieval is appropriate for unstructured literature search but introduces unsafe ambiguity when used as the substrate for variant-level reasoning that drives drug-selection decisions. We argue that typed-graph retrieval should be the default architecture for any retrieval-grounded clinical decision support system that recommends targeted therapy.

15

Detecting Manuscripts Related to Computable Phenotypes Using a Transformer-based Language Model

Chae, J.; Heise, D. A.; Connatser, K.; Honerlaw, J.; Maripuri, M.; Ho, Y.-L.; Fontin, F.; Tanukonda, V.; Cho, K.

2026-03-16 bioinformatics 10.64898/2026.03.12.711165 medRxiv

Top 0.2%

2.3%

Show abstract

ObjectiveThe demand for a comprehensive phenomics library, which requires identifying computable phenotype definitions and associated metadata from an ever-expanding biomedical literature, presents a significant, labor-intensive, and unscalable challenge. To address this, we introduce a transformer-based language model specifically designed for identifying biomedical texts containing computable phenotypes and piloted its use in the Centralized Interactive Phenomics Resource (CIPHER) platform. Materials and MethodsWe fine-tuned a BioBERT model using a labeled dataset of 396 manuscripts. The model incorporates our novel sliding-window approach to effectively overcome token-length limitations, thereby enabling accurate classification of full-length manuscripts. For scalable deployment and continuous refinement, we developed a cohesive framework that integrates a web-based user interface, a control server, and a classification module. ResultsThe staged approach for model development yielded a final model with 95% accuracy. The web-based user interface was deployed on the CIPHER platform and enables user feedback for model retraining. DiscussionWe developed a model and user interface which are currently in use by data curators to identify computable phenotype definitions from the literature. ConclusionThrough this system, users can submit literature, assess classification results, and provide feedback directly influencing future model training, thereby offering an efficient and adaptive solution for accelerating phenotype-driven literature curation.

16

ExposoGraph: An Interactive Platform for Carcinogen Bioactivation and Detoxification Pathway Visualization

Pienta, K.; Kazi, J. U.

2026-03-24 bioinformatics 10.64898/2026.03.22.713456 medRxiv

Top 0.2%

2.1%

Show abstract

BackgroundDespite extensive cataloging of carcinogenic exposures by the International Agency for Research on Cancer (IARC) and pharmacogenomic variation by resources such as PharmVar and CPIC, few platforms unify exposure, metabolic activation and detoxification, DNA damage, and genetic annotation within a single interactive visualization framework. This gap limits systematic evaluation of gene-environment interactions in cancer risk assessment. MethodsWe developed the Carcino-Genomic Knowledge Graph, ExposoGraph, an interactive knowledge-graph platform for carcinogen metabolism and DNA damage pathways. The reference graph integrates curated data and annotations from IARC, KEGG, PharmVar, CPIC, CTD, and supporting literature/resources. The current reference graph contains 96 nodes across 5 entity types (Carcinogens, Enzymes, Metabolites, DNA Adducts, and Pathways) and 102 edges across 6 relationship types (activates, detoxifies, transports, forms adduct, repairs, and pathway). ResultsThe first-generation reference graph captures metabolic activation and detoxification pathways for 9 carcinogen classes spanning 15 index carcinogens. It represents 36 enzymes across Phase I activation (n=14), Phase II conjugation and detoxification (n=14), Phase III transport (n=3), and DNA repair (n=5). Interactive exploration supports carcinogen-class filtering, node- and edge-type filtering, metadata-based search, and detailed hover/detail views with provenance and pharmacogenomic annotations. The androgen branch highlights cross-pathway connectivity by linking androgen metabolism to estrogen quinone formation and DNA adduct generation through CYP19A1-mediated aromatization and downstream catechol estrogen chemistry. In the optional androgen-focused extension, additional receptor, tissue, and variant context further connects this branch to androgen receptor signaling and genotype-specific annotations. ConclusionsExposoGraph provides a first-generation integrated, interactive framework linking carcinogenic exposures to metabolic fates and genetic modulators. The platform supports hypothesis generation for gene-environment interaction studies and may inform future individualized risk modeling, while remaining a research-use framework rather than a clinically validated risk-assessment tool.

17

Incidence and Severity of Carboplatin-Associated Hearing Loss in Children with Cancer Assessed by the SIOP 2012 Ototoxicity Criteria

Chawla, A.; Carter, S.; Wood, A.; Staffieri, S.; Dodgshun, A.; Eisenstat, D.; Sullivan, M.

2026-05-30 pediatrics 10.64898/2026.05.21.26353442 medRxiv

Top 0.2%

2.1%

Show abstract

Background: Platinum-based chemotherapy is known to cause severe and debilitating hearing loss, but unlike cisplatin, the true incidence of carboplatin-induced hearing loss remains unclear. We evaluated functional hearing outcomes in children receiving carboplatin to determine the incidence and severity of ototoxicity. Procedure: We identified a large cohort of children with cancer treated with carboplatin and graded their audiograms using the SIOP ototoxicity scale. Patients with inadequate audiological follow-up, prior hearing loss, or exposure to cisplatin were excluded. Fishers exact test, logistic regression, and ROC analyses were performed to investigate associations of demographic, treatment, and exposure-related risk factors with incidence of hearing loss. Results: 200 patients were included, all of whom had been treated with carboplatin. Only nine (4.5%) patients developed clinically significant hearing loss (SIOP grade [≥]2). Younger age at first exposure to carboplatin was the only significant predictor of hearing loss (OR = 0.7888, p=0.0241). Age [≤]28 months was significantly associated with hearing loss (OR 12.37, p=0.0042). No other risk factors or exposures were statistically significant. Conclusions: Clinically significant carboplatin-associated hearing loss was uncommon (incidence 4.5%). We show that young age is the single-most important risk factor for hearing loss; of nine children who developed hearing loss, eight were aged [≤]28 months. Children below this age have twelve-fold higher odds of developing hearing loss compared to those above this age (OR 12.37). These findings will allow physicians to provide more appropriate counselling to families regarding ototoxic risk and support intensified hearing surveillance in young children.

18

Evaluating Large Language Models for Transparent Quality-of-Care Measurement in Children with ADHD

Bannett, Y.; Pillai, M.; Huang, T.; Luo, I.; Gunturkun, F.; Hernandez-Boussard, T.

2026-04-17 pediatrics 10.64898/2026.04.12.26350732 medRxiv

Top 0.2%

2.1%

Show abstract

ImportanceGuideline-concordant care for young children with attention-deficit/hyperactivity disorder (ADHD) includes recommending parent training in behavior management (PTBM) as first-line treatment. However, assessing guideline adherence through manual chart review is time-consuming and costly, limiting scalable and timely quality-of-care measurement. ObjectiveTo evaluate the accuracy and explainability of large language models (LLMs) in identifying PTBM recommendations in pediatric electronic health record (EHR) notes as a scalable alternative to manual chart review. Design, Setting, and ParticipantsThis retrospective cohort study was conducted in a community-based pediatric healthcare network in California consisting of 27 primary care clinics. The study cohort included children aged 4-6 years with [≥] 2 primary care visits between 2020-2024 and ICD-10 diagnoses of ADHD or ADHD symptoms (n=542 patients). Clinical notes from the first ADHD-related visit were included. A stratified subset of 122 notes, including all cases with model disagreement, was manually annotated to assess model performance in identifying PTBM recommendations and rank model explanations. ExposuresAssessment and plan sections of clinical notes were analyzed using three generative large language models (Claude-3.5, GPT-4o, and LLaMA-3.3-70B) to identify the presence of PTBM recommendations and generate explanatory rationales and documentation evidence. Main Outcomes and MeasuresModel performance in identifying PTBM recommendations (measured by sensitivity, positive predictive value (PPV), and F1-score) and qualitative explainability ratings of model-generated rationales (based on the QUEST framework). ResultsAll three models demonstrated high performance compared to expert chart review. Claude-3.5 showed balanced performance (sensitivity=0.89, PPV=0.95, and F1-score=0.92) and ranked highest in explainability. LLaMA3.3-70B achieved sensitivity=0.91, PPV=0.89, and F1-score=0.90, ranking second for explainability. GPT-4o had the highest PPV [0.97] but lowest sensitivity [0.82], with an F1-score of 0.89 and the lowest explainability ranking. Based on classifications from the best-performing model, Claude-3.5, 26.4% (143/542) of patients had documented PTBM recommendations at their first ADHD-related visit. Conclusions and RelevanceLLMs can accurately extract guideline-concordant clinician recommendations for non-pharmacological ADHD treatment from unstructured clinical notes while providing clear explanations and supporting evidence. Evaluating model explainability as part of LLM implementation for medical chart review tasks can promote transparent and scalable solutions for quality-of-care measurement.

19

Vital signs, demographics, and clinical events for low-birth-weight infants from four intensive care units

German Mesner, I.; Lake, D. E.; Kausch, S. L.; Krahn, K. N.; Gummadi, A.; Clark, T. W.; Niestroy, J. C.; Sahni, R.; Vesoulis, Z. A.; Gootenberg, D. B.; Ambalavanan, N.; Travers, C. P.; Fairchild, K. D.; Sullivan, B. A.

2026-04-20 pediatrics 10.64898/2026.04.15.26350178 medRxiv

Top 0.2%

2.1%

Show abstract

Premature very low birth weight (VLBW) infants have high rates of mortality and morbidity from sepsis, necrotizing enterocolitis, and respiratory failure requiring intubation and mechanical ventilation. Earlier detection of cardiorespiratory deterioration using vital signs from continuous physiological monitoring may lead to more timely interventions and improved outcomes. To further this research area, we present PreMo, a publicly available dataset of continuous heart rate and oxygen saturation, demographics, clinical events, and outcomes for 3,829 VLBW patients from four Neonatal Intensive Care Units (NICUs) in the United States. The PreMo dataset consists of a collection of parquet files, RO-Crate metadata, and sample usage code scripts hosted on the University of Virginia LibraData Dataverse website.

20

Structured large language model extraction of clinical factors from electronic health record text supports scalable psychiatric severity prediction

Stephenson, C.; Camassa, A.; Wagner, M.; Shirazi, A. H.; Alavi, N.; Omrani, M.

2026-05-13 psychiatry and clinical psychology 10.64898/2026.05.11.26352839 medRxiv

Top 0.2%

2.1%

Show abstract

BackgroundMental health systems face escalating demand that exceeds clinician capacity, making accurate severity-based triage a critical bottleneck. Severity assessment guides treatment intensity, resource allocation, and risk management, yet most clinically relevant information remains embedded in unstructured electronic health record (EHR) narratives, limiting its utility for scalable decision support. ObjectivesThis study evaluates whether a single large language model (LLM) can autonomously extract clinical factors from psychiatric EHR narratives, derive predictive weights from those factors, and use the resulting structured representation to predict clinician-implied severity at scale. MethodsFrom a Mayo Clinic repository of more than 2.7 million encounters, 15,000 de-identified psychiatric notes were sampled into a 5,000-patient discovery cohort and a 10,000-patient replication cohort. The same LLM (Llama 3 8B Instruct) extracted 17 background clinical factors and 3 treatment-action factors from each note. Severity reference labels were derived from the treatment-action factors using pre-specified clinical criteria. The LLM independently derived two factor-weight dictionaries from the discovery cohort: one capturing risk-oriented predictors of severe presentations and one capturing protective predictors. Five weighting conditions were then evaluated against the severity labels: the two LLM-derived dictionaries, two controls (LLM-derived variables with randomized weights; clinically irrelevant variables with arbitrary weights), and an unweighted zero-shot baseline. Performance was assessed across 928 valid iterations in the replication cohort. ResultsLLM-derived structured conditions significantly outperformed all controls and the baseline, with statistically equivalent performance between the two structured conditions. Improvements in precision and recall were balanced, indicating gains in discriminative capacity rather than threshold shifts. The variables and weights the LLM derived as predictors of severe presentations aligned closely with established clinical determinants of psychiatric severity. ConclusionA single LLM can derive clinically meaningful factor weights from unstructured EHR narratives and use them to predict psychiatric severity at scale, supporting a viable path toward interpretable, scalable triage in resource-constrained mental health systems.